Computation and Language
☆ Model-tuning Via Prompts Makes NLP Models Adversarially Robust
In recent years, NLP practitioners have converged on the following practice:
(i) import an off-the-shelf pretrained (masked) language model; (ii) append a
multilayer perceptron atop the CLS token's hidden representation (with randomly
initialized weights); and (iii) fine-tune the entire model on a downstream task
(MLP). This procedure has produced massive gains on standard NLP benchmarks,
but these models remain brittle, even to mild adversarial perturbations, such
as word-level synonym substitutions. In this work, we demonstrate surprising
gains in adversarial robustness enjoyed by Model-tuning Via Prompts (MVP), an
alternative method of adapting to downstream tasks. Rather than modifying the
model (by appending an MLP head), MVP instead modifies the input (by appending
a prompt template). Across three classification datasets, MVP improves
performance against adversarial word-level synonym substitutions by an average
of 8% over standard methods and even outperforms adversarial training-based
state-of-art defenses by 3.5%. By combining MVP with adversarial training, we
achieve further improvements in robust accuracy while maintaining clean
accuracy. Finally, we conduct ablations to investigate the mechanism underlying
these gains. Notably, we find that the main causes of vulnerability of MLP can
be attributed to the misalignment between pre-training and fine-tuning tasks,
and the randomly initialized MLP parameters. Code is available at
https://github.com/acmi-lab/mvp
☆ Meet in the Middle: A New Pre-training Paradigm
Most language models (LMs) are trained and applied in an autoregressive
left-to-right fashion, assuming that the next token only depends on the
preceding ones. However, this assumption ignores the potential benefits of
using the full sequence information during training, and the possibility of
having context from both sides during inference. In this paper, we propose a
new pre-training paradigm with techniques that jointly improve the training
data efficiency and the capabilities of the LMs in the infilling task. The
first is a training objective that aligns the predictions of a left-to-right LM
with those of a right-to-left LM, trained on the same data but in reverse
order. The second is a bidirectional inference procedure that enables both LMs
to meet in the middle. We show the effectiveness of our pre-training paradigm
with extensive experiments on both programming and natural language models,
outperforming strong baselines.
comment: 24 pages, 2 figures
☆ Transformer-based approaches to Sentiment Detection
Olumide Ebenezer Ojo, Hoang Thang Ta, Alexander Gelbukh, Hiram Calvo, Olaronke Oluwayemisi Adebanji, Grigori Sidorov
The use of transfer learning methods is largely responsible for the present
breakthrough in Natural Learning Processing (NLP) tasks across multiple
domains. In order to solve the problem of sentiment detection, we examined the
performance of four different types of well-known state-of-the-art transformer
models for text classification. Models such as Bidirectional Encoder
Representations from Transformers (BERT), Robustly Optimized BERT Pre-training
Approach (RoBERTa), a distilled version of BERT (DistilBERT), and a large
bidirectional neural network architecture (XLNet) were proposed. The
performance of the four models that were used to detect disaster in the text
was compared. All the models performed well enough, indicating that
transformer-based models are suitable for the detection of disaster in text.
The RoBERTa transformer model performs best on the test dataset with a score of
82.6% and is highly recommended for quality predictions. Furthermore, we
discovered that the learning algorithms' performance was influenced by the
pre-processing techniques, the nature of words in the vocabulary, unbalanced
labeling, and the model parameters.
comment: Publisher: Springer Nature Switzerland AG, Gewerbestrasse 11, 6330
Cham, Switzerland Published in Book Titled: Recent Developments and the New
Directions of Research, Foundations, and Applications: Selected Papers of the
8th World Conference on Soft Computing, February 03-05, 2022, Baku,
Azerbaijan
☆ Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
Nitzan Bitton-Guetta, Yonatan Bitton, Jack Hessel, Ludwig Schmidt, Yuval Elovici, Gabriel Stanovsky, Roy Schwartz
Weird, unusual, and uncanny images pique the curiosity of observers because
they challenge commonsense. For example, an image released during the 2022
world cup depicts the famous soccer stars Lionel Messi and Cristiano Ronaldo
playing chess, which playfully violates our expectation that their competition
should occur on the football field. Humans can easily recognize and interpret
these unconventional images, but can AI models do the same? We introduce
WHOOPS!, a new dataset and benchmark for visual commonsense. The dataset is
comprised of purposefully commonsense-defying images created by designers using
publicly-available image generation tools like Midjourney. We consider several
tasks posed over the dataset. In addition to image captioning, cross-modal
matching, and visual question answering, we introduce a difficult explanation
generation task, where models must identify and explain why a given image is
unusual. Our results show that state-of-the-art models such as GPT3 and BLIP2
still lag behind human performance on WHOOPS!. We hope our dataset will inspire
the development of AI models with stronger visual commonsense reasoning
abilities. Data, models and code are available at the project website:
whoops-benchmark.github.io
☆ Are Models Trained on Indian Legal Data Fair?
Sahil Girhepuje, Anmol Goel, Gokul Krishnan, Shreya Goyal, Satyendra Pandey, Ponnurangam Kumaraguru, Balaram Ravindran
Recent advances and applications of language technology and artificial
intelligence have enabled much success across multiple domains like law,
medical and mental health. AI-based Language Models, like Judgement Prediction,
have recently been proposed for the legal sector. However, these models are
strife with encoded social biases picked up from the training data. While bias
and fairness have been studied across NLP, most studies primarily locate
themselves within a Western context. In this work, we present an initial
investigation of fairness from the Indian perspective in the legal domain. We
highlight the propagation of learnt algorithmic biases in the bail prediction
task for models trained on Hindi legal documents. We evaluate the fairness gap
using demographic parity and show that a decision tree model trained for the
bail prediction task has an overall fairness disparity of 0.237 between input
features associated with Hindus and Muslims. Additionally, we highlight the
need for further research and studies in the avenues of fairness/bias in
applying AI in the legal sector with a specific focus on the Indian context.
comment: Presented at the Symposium on AI and Law (SAIL) 2023
☆ PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents
Foundation models trained on large-scale dataset gain a recent surge in CV
and NLP. In contrast, development in biomedical domain lags far behind due to
data scarcity. To address this issue, we build and release PMC-OA, a biomedical
dataset with 1.6M image-caption pairs collected from PubMedCentral's OpenAccess
subset, which is 8 times larger than before. PMC-OA covers diverse modalities
or diseases, with majority of the image-caption samples aligned at
finer-grained level, i.e., subfigure and subcaption. While pretraining a
CLIP-style model on PMC-OA, our model named PMC-CLIP achieves state-of-the-art
results on various downstream tasks, including image-text retrieval on ROCO,
MedMNIST image classification, Medical VQA, i.e. +8.1% R@10 on image-text
retrieval, +3.9% accuracy on image classification.
comment: 10 pages, 3 figures
☆ Scaling Vision-Language Models with Sparse Mixture of Experts
The field of natural language processing (NLP) has made significant strides
in recent years, particularly in the development of large-scale vision-language
models (VLMs). These models aim to bridge the gap between text and visual
information, enabling a more comprehensive understanding of multimedia data.
However, as these models become larger and more complex, they also become more
challenging to train and deploy. One approach to addressing this challenge is
the use of sparsely-gated mixture-of-experts (MoE) techniques, which divide the
model into smaller, specialized sub-models that can jointly solve a task. In
this paper, we explore the effectiveness of MoE in scaling vision-language
models, demonstrating its potential to achieve state-of-the-art performance on
a range of benchmarks over dense models of equivalent computational cost. Our
research offers valuable insights into stabilizing the training of MoE models,
understanding the impact of MoE on model interpretability, and balancing the
trade-offs between compute performance when scaling VLMs. We hope our work will
inspire further research into the use of MoE for scaling large-scale
vision-language models and other multimodal machine learning applications.
comment: Preprint
☆ A Comprehensive Empirical Evaluation of Existing Word Embedding Approaches
Vector-based word representations help countless Natural Language Processing
(NLP) tasks capture both semantic and syntactic regularities of the language.
In this paper, we present the characteristics of existing word embedding
approaches and analyze them with regards to many classification tasks. We
categorize the methods into two main groups - Traditional approaches mostly use
matrix factorization to produce word representations, and they are not able to
capture the semantic and syntactic regularities of the language very well.
Neural-Network based approaches, on the other hand, can capture sophisticated
regularities of the language and preserve the word relationships in the
generated word representations. We report experimental results on multiple
classification tasks and highlight the scenarios where one approach performs
better than the rest.
comment: 28 pages
☆ NeuroQL: A Neuro-Symbolic Language and Dataset for Inter-Subjective Reasoning
We present a new AI task and baseline solution for Inter-Subjective
Reasoning. We define inter-subjective information, to be a mixture of objective
and subjective information possibly shared by different parties. Examples may
include commodities and their objective properties as reported by IR
(Information Retrieval) systems, that need to be cross-referenced with
subjective user reviews from an online forum. For an AI system to successfully
reason about both, it needs to be able to combine symbolic reasoning of
objective facts with the shared consensus found on subjective user reviews. To
this end we introduce the NeuroQL dataset and DSL (Domain-specific Language) as
a baseline solution for this problem. NeuroQL is a neuro-symbolic language that
extends logical unification with neural primitives for extraction and
retrieval. It can function as a target for automatic translation of
inter-subjective questions (posed in natural language) into the neuro-symbolic
code that can answer them.
comment: 18 pages, 6 figures
☆ Large Language Models in the Workplace: A Case Study on Prompt Engineering for Job Type Classification
This case study investigates the task of job classification in a real-world
setting, where the goal is to determine whether an English-language job posting
is appropriate for a graduate or entry-level position. We explore multiple
approaches to text classification, including supervised approaches such as
traditional models like Support Vector Machines (SVMs) and state-of-the-art
deep learning methods such as DeBERTa. We compare them with Large Language
Models (LLMs) used in both few-shot and zero-shot classification settings. To
accomplish this task, we employ prompt engineering, a technique that involves
designing prompts to guide the LLMs towards the desired output. Specifically,
we evaluate the performance of two commercially available state-of-the-art
GPT-3.5-based language models, text-davinci-003 and gpt-3.5-turbo. We also
conduct a detailed analysis of the impact of different aspects of prompt
engineering on the model's performance. Our results show that, with a
well-designed prompt, a zero-shot gpt-3.5-turbo classifier outperforms all
other models, achieving a 6% increase in Precision@95% Recall compared to the
best supervised approach. Furthermore, we observe that the wording of the
prompt is a critical factor in eliciting the appropriate "reasoning" in the
model, and that seemingly minor aspects of the prompt significantly affect the
model's performance.
☆ Generating multiple-choice questions for medical question answering with distractors and cue-masking
Medical multiple-choice question answering (MCQA) is particularly difficult.
Questions may describe patient symptoms and ask for the correct diagnosis,
which requires domain knowledge and complex reasoning. Standard language
modeling pretraining alone is not sufficient to achieve the best results.
\citet{jin2020disease} showed that focusing masked language modeling on disease
name prediction when using medical encyclopedic paragraphs as input leads to
considerable MCQA accuracy improvement. In this work, we show that (1)
fine-tuning on generated MCQA dataset outperforms the masked language modeling
based objective and (2) correctly masking the cues to the answers is critical
for good performance. We release new pretraining datasets and achieve
state-of-the-art results on 4 MCQA datasets, notably +5.7\% with base-size
model on MedQA-USMLE.
☆ Addressing Biases in the Texts using an End-to-End Pipeline Approach ECIR 2023
The concept of fairness is gaining popularity in academia and industry.
Social media is especially vulnerable to media biases and toxic language and
comments. We propose a fair ML pipeline that takes a text as input and
determines whether it contains biases and toxic content. Then, based on
pre-trained word embeddings, it suggests a set of new words by substituting the
bi-ased words, the idea is to lessen the effects of those biases by replacing
them with alternative words. We compare our approach to existing fairness
models to determine its effectiveness. The results show that our proposed
pipeline can de-tect, identify, and mitigate biases in social media data
comment: Accepted in Bias @ ECIR 2023
☆ A Human Subject Study of Named Entity Recognition (NER) in Conversational Music Recommendation Queries
We conducted a human subject study of named entity recognition on a noisy
corpus of conversational music recommendation queries, with many irregular and
novel named entities. We evaluated the human NER linguistic behaviour in these
challenging conditions and compared it with the most common NER systems
nowadays, fine-tuned transformers. Our goal was to learn about the task to
guide the design of better evaluation methods and NER algorithms. The results
showed that NER in our context was quite hard for both human and algorithms
under a strict evaluation schema; humans had higher precision, while the model
higher recall because of entity exposure especially during pre-training; and
entity types had different error patterns (e.g. frequent typing errors for
artists). The released corpus goes beyond predefined frames of interaction and
can support future work in conversational music recommendation.
☆ Contextually-rich human affect perception using multimodal scene information ICASSP
The process of human affect understanding involves the ability to infer
person specific emotional states from various sources including images, speech,
and language. Affect perception from images has predominantly focused on
expressions extracted from salient face crops. However, emotions perceived by
humans rely on multiple contextual cues including social settings, foreground
interactions, and ambient visual scenes. In this work, we leverage pretrained
vision-language (VLN) models to extract descriptions of foreground context from
images. Further, we propose a multimodal context fusion (MCF) module to combine
foreground cues with the visual scene and person-based contextual information
for emotion prediction. We show the effectiveness of our proposed modular
design on two datasets associated with natural scenes and TV shows.
comment: Accepted to IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), 2023
☆ The System Description of dun_oscar team for The ICPR MSR Challenge
This paper introduces the system submitted by dun_oscar team for the ICPR MSR
Challenge. Three subsystems for task1-task3 are descripted respectively. In
task1, we develop a visual system which includes a OCR model, a text tracker,
and a NLP classifier for distinguishing subtitles and non-subtitles. In task2,
we employ an ASR system which includes an AM with 18 layers and a 4-gram LM.
Semi-supervised learning on unlabeled data is also vital. In task3, we employ
the ASR system to improve the visual system, some false subtitles can be
corrected by a fusion module.
☆ Robust Contrastive Language-Image Pretraining against Adversarial Attacks
Contrastive vision-language representation learning has achieved
state-of-the-art performance for zero-shot classification, by learning from
millions of image-caption pairs crawled from the internet. However, the massive
data that powers large multimodal models such as CLIP, makes them extremely
vulnerable to various types of adversarial attacks, including targeted and
backdoor data poisoning attacks. Despite this vulnerability, robust contrastive
vision-language pretraining against adversarial attacks has remained
unaddressed. In this work, we propose RoCLIP, the first effective method for
robust pretraining {and fine-tuning} multimodal vision-language models. RoCLIP
effectively breaks the association between poisoned image-caption pairs by
considering a pool of random examples, and (1) matching every image with the
text that is most similar to its caption in the pool, and (2) matching every
caption with the image that is most similar to its image in the pool. Our
extensive experiments show that our method renders state-of-the-art targeted
data poisoning and backdoor attacks ineffective during pre-training or
fine-tuning of CLIP. In particular, RoCLIP decreases the poison and backdoor
attack success rates down to 0\% during pre-training and 1\%-4\% during
fine-tuning, and effectively improves the model's performance.
☆ Learning Transductions and Alignments with RNN Seq2seq Models
The paper studies the capabilities of Recurrent-Neural-Network sequence to
sequence (RNN seq2seq) models in learning four string-to-string transduction
tasks: identity, reversal, total reduplication, and input-specified
reduplication. These transductions are traditionally well studied under finite
state transducers and attributed with varying complexity. We find that RNN
seq2seq models are only able to approximate a mapping that fits the training or
in-distribution data. Attention helps significantly, but does not solve the
out-of-distribution generalization limitation. Task complexity and RNN variants
also play a role in the results. Our results are best understood in terms of
the complexity hierarchy of formal languages as opposed to that of string
transductions.
comment: 24 pages; 9 figures; 7 tables
☆ Neural Diarization with Non-autoregressive Intermediate Attractors ICASSP 2023
End-to-end neural diarization (EEND) with encoder-decoder-based attractors
(EDA) is a promising method to handle the whole speaker diarization problem
simultaneously with a single neural network. While the EEND model can produce
all frame-level speaker labels simultaneously, it disregards output label
dependency. In this work, we propose a novel EEND model that introduces the
label dependency between frames. The proposed method generates
non-autoregressive intermediate attractors to produce speaker labels at the
lower layers and conditions the subsequent layers with these labels. While the
proposed model works in a non-autoregressive manner, the speaker labels are
refined by referring to the whole sequence of intermediate labels. The
experiments with the two-speaker CALLHOME dataset show that the intermediate
labels with the proposed non-autoregressive intermediate attractors boost the
diarization performance. The proposed method with the deeper network benefits
more from the intermediate labels, resulting in better performance and training
throughput than EEND-EDA.
comment: ICASSP 2023
☆ Beyond Single Items: Exploring User Preferences in Item Sets with the Conversational Playlist Curation Dataset
Users in consumption domains, like music, are often able to more efficiently
provide preferences over a set of items (e.g. a playlist or radio) than over
single items (e.g. songs). Unfortunately, this is an underexplored area of
research, with most existing recommendation systems limited to understanding
preferences over single items. Curating an item set exponentiates the search
space that recommender systems must consider (all subsets of items!): this
motivates conversational approaches-where users explicitly state or refine
their preferences and systems elicit preferences in natural language-as an
efficient way to understand user needs. We call this task conversational item
set curation and present a novel data collection methodology that efficiently
collects realistic preferences about item sets in a conversational setting by
observing both item-level and set-level feedback. We apply this methodology to
music recommendation to build the Conversational Playlist Curation Dataset
(CPCD), where we show that it leads raters to express preferences that would
not be otherwise expressed. Finally, we propose a wide range of conversational
retrieval models as baselines for this task and evaluate them on the dataset.
♻ ☆ Unifying Vision, Text, and Layout for Universal Document Processing CVPR 2023
Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal
We propose Universal Document Processing (UDOP), a foundation Document AI
model which unifies text, image, and layout modalities together with varied
task formats, including document understanding and generation. UDOP leverages
the spatial correlation between textual content and document image to model
image, text, and layout modalities with one uniform representation. With a
novel Vision-Text-Layout Transformer, UDOP unifies pretraining and multi-domain
downstream tasks into a prompt-based sequence generation scheme. UDOP is
pretrained on both large-scale unlabeled document corpora using innovative
self-supervised objectives and diverse labeled data. UDOP also learns to
generate document images from text and layout modalities via masked image
reconstruction. To the best of our knowledge, this is the first time in the
field of document AI that one model simultaneously achieves high-quality neural
document editing and content customization. Our method sets the
state-of-the-art on 8 Document AI tasks, e.g., document understanding and QA,
across diverse data domains like finance reports, academic papers, and
websites. UDOP ranks first on the leaderboard of the Document Understanding
Benchmark.
comment: CVPR 2023
♻ ☆ Clinical BERTScore: An Improved Measure of Automatic Speech Recognition Performance in Clinical Settings
Automatic Speech Recognition (ASR) in medical contexts has the potential to
save time, cut costs, increase report accuracy, and reduce physician burnout.
However, the healthcare industry has been slower to adopt this technology, in
part due to the importance of avoiding medically-relevant transcription
mistakes. In this work, we present the Clinical BERTScore (CBERTScore), an ASR
metric that penalizes clinically-relevant mistakes more than others. We
demonstrate that this metric more closely aligns with clinician preferences on
medical sentences as compared to other metrics (WER, BLUE, METEOR, etc),
sometimes by wide margins. We collect a benchmark of 13 clinician preferences
on 149 realistic medical sentences called the Clinician Transcript Preference
benchmark (CTP), demonstrate that CBERTScore more closely matches what
clinicians prefer, and release the benchmark for the community to further
develop clinically-aware ASR metrics.
♻ ☆ AdapterSoup: Weight Averaging to Improve Generalization of Pretrained Language Models EACL 2023
Pretrained language models (PLMs) are trained on massive corpora, but often
need to specialize to specific domains. A parameter-efficient adaptation method
suggests training an adapter for each domain on the task of language modeling.
This leads to good in-domain scores but can be impractical for domain- or
resource-restricted settings. A solution is to use a related-domain adapter for
the novel domain at test time. In this paper, we introduce AdapterSoup, an
approach that performs weight-space averaging of adapters trained on different
domains. Our approach is embarrassingly parallel: first, we train a set of
domain-specific adapters; then, for each novel domain, we determine which
adapters should be averaged at test time. We present extensive experiments
showing that AdapterSoup consistently improves performance to new domains
without extra training. We also explore weight averaging of adapters trained on
the same domain with different hyper-parameters, and show that it preserves the
performance of a PLM on new domains while obtaining strong in-domain results.
We explore various approaches for choosing which adapters to combine, such as
text clustering and semantic similarity. We find that using clustering leads to
the most competitive results on novel domains.
comment: Accepted at EACL 2023; camera-ready version
♻ ☆ Attribution and Obfuscation of Neural Text Authorship: A Data Mining Perspective KDD
Two interlocking research questions of growing interest and importance in
privacy research are Authorship Attribution (AA) and Authorship Obfuscation
(AO). Given an artifact, especially a text t in question, an AA solution aims
to accurately attribute t to its true author out of many candidate authors
while an AO solution aims to modify t to hide its true authorship.
Traditionally, the notion of authorship and its accompanying privacy concern is
only toward human authors. However, in recent years, due to the explosive
advancements in Neural Text Generation (NTG) techniques in NLP, capable of
synthesizing human-quality open-ended texts (so-called "neural texts"), one has
to now consider authorships by humans, machines, or their combination. Due to
the implications and potential threats of neural texts when used maliciously,
it has become critical to understand the limitations of traditional AA/AO
solutions and develop novel AA/AO solutions in dealing with neural texts. In
this survey, therefore, we make a comprehensive review of recent literature on
the attribution and obfuscation of neural text authorship from a Data Mining
perspective, and share our view on their limitations and promising research
directions.
comment: Accepted at ACM SIGKDD Explorations, Vol. 25, June 2023
♻ ☆ BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor Soroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Chris Emezue, Christopher Klamm, Colin Leong, Daniel van Strien, David Ifeoluwa Adelani, Dragomir Radev, Eduardo González Ponferrada, Efrat Levkovizh, Ethan Kim, Eyal Bar Natan, Francesco De Toni, Gérard Dupont, Germán Kruszewski, Giada Pistilli, Hady Elsahar, Hamza Benyamina, Hieu Tran, Ian Yu, Idris Abdulmumin, Isaac Johnson, Itziar Gonzalez-Dios, Javier de la Rosa, Jenny Chim, Jesse Dodge, Jian Zhu, Jonathan Chang, Jörg Frohberg, Joseph Tobing, Joydeep Bhattacharjee, Khalid Almubarak, Kimbo Chen, Kyle Lo, Leandro Von Werra, Leon Weber, Long Phan, Loubna Ben allal, Ludovic Tanguy, Manan Dey, Manuel Romero Muñoz, Maraim Masoud, María Grandury, Mario Šaško, Max Huang, Maximin Coavoux, Mayank Singh, Mike Tian-Jian Jiang, Minh Chien Vu, Mohammad A. Jauhar, Mustafa Ghaleb, Nishant Subramani, Nora Kassner, Nurulaqilla Khamis, Olivier Nguyen, Omar Espejel, Ona de Gibert, Paulo Villegas, Peter Henderson, Pierre Colombo, Priscilla Amuok, Quentin Lhoest, Rheza Harliman, Rishi Bommasani, Roberto Luis López, Rui Ribeiro, Salomey Osei, Sampo Pyysalo, Sebastian Nagel, Shamik Bose, Shamsuddeen Hassan Muhammad, Shanya Sharma, Shayne Longpre, Somaieh Nikpoor, Stanislav Silberberg, Suhas Pai, Sydney Zink, Tiago Timponi Torrent, Timo Schick, Tristan Thrush, Valentin Danchev, Vassilina Nikoulina, Veronika Laippala, Violette Lepercq, Vrinda Prabhu, Zaid Alyafeai, Zeerak Talat, Arun Raja, Benjamin Heinzerling, Chenglei Si, Davut Emre Taşar, Elizabeth Salesky, Sabrina J. Mielke, Wilson Y. Lee, Abheesht Sharma, Andrea Santilli, Antoine Chaffin, Arnaud Stiegler, Debajyoti Datta, Eliza Szczechla, Gunjan Chhablani, Han Wang, Harshit Pandey, Hendrik Strobelt, Jason Alan Fries, Jos Rozen, Leo Gao, Lintang Sutawika, M Saiful Bari, Maged S. Al-shaibani, Matteo Manica, Nihal Nayak, Ryan Teehan, Samuel Albanie, Sheng Shen, Srulik Ben-David, Stephen H. Bach, Taewoon Kim, Tali Bers, Thibault Fevry, Trishala Neeraj, Urmish Thakker, Vikas Raunak, Xiangru Tang, Zheng-Xin Yong, Zhiqing Sun, Shaked Brody, Yallow Uri, Hadar Tojarieh, Adam Roberts, Hyung Won Chung, Jaesung Tae, Jason Phang, Ofir Press, Conglong Li, Deepak Narayanan, Hatim Bourfoune, Jared Casper, Jeff Rasley, Max Ryabinin, Mayank Mishra, Minjia Zhang, Mohammad Shoeybi, Myriam Peyrounette, Nicolas Patry, Nouamane Tazi, Omar Sanseviero, Patrick von Platen, Pierre Cornette, Pierre François Lavallée, Rémi Lacroix, Samyam Rajbhandari, Sanchit Gandhi, Shaden Smith, Stéphane Requena, Suraj Patil, Tim Dettmers, Ahmed Baruwa, Amanpreet Singh, Anastasia Cheveleva, Anne-Laure Ligozat, Arjun Subramonian, Aurélie Névéol, Charles Lovering, Dan Garrette, Deepak Tunuguntla, Ehud Reiter, Ekaterina Taktasheva, Ekaterina Voloshina, Eli Bogdanov, Genta Indra Winata, Hailey Schoelkopf, Jan-Christoph Kalo, Jekaterina Novikova, Jessica Zosa Forde, Jordan Clive, Jungo Kasai, Ken Kawamura, Liam Hazan, Marine Carpuat, Miruna Clinciu, Najoung Kim, Newton Cheng, Oleg Serikov, Omer Antverg, Oskar van der Wal, Rui Zhang, Ruochen Zhang, Sebastian Gehrmann, Shachar Mirkin, Shani Pais, Tatiana Shavrina, Thomas Scialom, Tian Yun, Tomasz Limisiewicz, Verena Rieser, Vitaly Protasov, Vladislav Mikhailov, Yada Pruksachatkun, Yonatan Belinkov, Zachary Bamberger, Zdeněk Kasner, Alice Rueda, Amanda Pestana, Amir Feizpour, Ammar Khan, Amy Faranak, Ana Santos, Anthony Hevia, Antigona Unldreaj, Arash Aghagol, Arezoo Abdollahi, Aycha Tammour, Azadeh HajiHosseini, Bahareh Behroozi, Benjamin Ajibade, Bharat Saxena, Carlos Muñoz Ferrandis, Danish Contractor, David Lansky, Davis David, Douwe Kiela, Duong A. Nguyen, Edward Tan, Emi Baylor, Ezinwanne Ozoani, Fatima Mirza, Frankline Ononiwu, Habib Rezanejad, Hessie Jones, Indrani Bhattacharya, Irene Solaiman, Irina Sedenko, Isar Nejadgholi, Jesse Passmore, Josh Seltzer, Julio Bonis Sanz, Livia Dutra, Mairon Samagaio, Maraim Elbadri, Margot Mieskes, Marissa Gerchick, Martha Akinlolu, Michael McKenna, Mike Qiu, Muhammed Ghauri, Mykola Burynok, Nafis Abrar, Nazneen Rajani, Nour Elkott, Nour Fahmy, Olanrewaju Samuel, Ran An, Rasmus Kromann, Ryan Hao, Samira Alizadeh, Sarmad Shubber, Silas Wang, Sourav Roy, Sylvain Viguier, Thanh Le, Tobi Oyebade, Trieu Le, Yoyo Yang, Zach Nguyen, Abhinav Ramesh Kashyap, Alfredo Palasciano, Alison Callahan, Anima Shukla, Antonio Miranda-Escalada, Ayush Singh, Benjamin Beilharz, Bo Wang, Caio Brito, Chenxi Zhou, Chirag Jain, Chuxin Xu, Clémentine Fourrier, Daniel León Periñán, Daniel Molano, Dian Yu, Enrique Manjavacas, Fabio Barth, Florian Fuhrimann, Gabriel Altay, Giyaseddin Bayrak, Gully Burns, Helena U. Vrabec, Imane Bello, Ishani Dash, Jihyun Kang, John Giorgi, Jonas Golde, Jose David Posada, Karthik Rangasai Sivaraman, Lokesh Bulchandani, Lu Liu, Luisa Shinzato, Madeleine Hahn de Bykhovetz, Maiko Takeuchi, Marc Pàmies, Maria A Castillo, Marianna Nezhurina, Mario Sänger, Matthias Samwald, Michael Cullan, Michael Weinberg, Michiel De Wolf, Mina Mihaljcic, Minna Liu, Moritz Freidank, Myungsun Kang, Natasha Seelam, Nathan Dahlberg, Nicholas Michio Broad, Nikolaus Muellner, Pascale Fung, Patrick Haller, Ramya Chandrasekhar, Renata Eisenberg, Robert Martin, Rodrigo Canalli, Rosaline Su, Ruisi Su, Samuel Cahyawijaya, Samuele Garda, Shlok S Deshmukh, Shubhanshu Mishra, Sid Kiblawi, Simon Ott, Sinee Sang-aroonsiri, Srishti Kumar, Stefan Schweter, Sushil Bharati, Tanmay Laud, Théo Gigant, Tomoya Kainuma, Wojciech Kusa, Yanis Labrak, Yash Shailesh Bajaj, Yash Venkatraman, Yifan Xu, Yingxin Xu, Yu Xu, Zhe Tan, Zhongli Xie, Zifan Ye, Mathilde Bras, Younes Belkada, Thomas Wolf
Large language models (LLMs) have been shown to be able to perform new tasks
based on a few demonstrations or natural language instructions. While these
capabilities have led to widespread adoption, most LLMs are developed by
resource-rich organizations and are frequently kept from the public. As a step
towards democratizing this powerful technology, we present BLOOM, a
176B-parameter open-access language model designed and built thanks to a
collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer
language model that was trained on the ROOTS corpus, a dataset comprising
hundreds of sources in 46 natural and 13 programming languages (59 in total).
We find that BLOOM achieves competitive performance on a wide variety of
benchmarks, with stronger results after undergoing multitask prompted
finetuning. To facilitate future research and applications using LLMs, we
publicly release our models and code under the Responsible AI License.
♻ ☆ Neural Transducer Training: Reduced Memory Consumption with Sample-wise Computation
The neural transducer is an end-to-end model for automatic speech recognition
(ASR). While the model is well-suited for streaming ASR, the training process
remains challenging. During training, the memory requirements may quickly
exceed the capacity of state-of-the-art GPUs, limiting batch size and sequence
lengths. In this work, we analyze the time and space complexity of a typical
transducer training setup. We propose a memory-efficient training method that
computes the transducer loss and gradients sample by sample. We present
optimizations to increase the efficiency and parallelism of the sample-wise
method. In a set of thorough benchmarks, we show that our sample-wise method
significantly reduces memory usage, and performs at competitive speed when
compared to the default batched computation. As a highlight, we manage to
compute the transducer loss and gradients for a batch size of 1024, and audio
length of 40 seconds, using only 6 GB of memory.
comment: 5 pages, 4 figures, 1 table, 1 algorithm
♻ ☆ Temporal Sentence Grounding in Videos: A Survey and Future Directions
Temporal sentence grounding in videos (TSGV), \aka natural language video
localization (NLVL) or video moment retrieval (VMR), aims to retrieve a
temporal moment that semantically corresponds to a language query from an
untrimmed video. Connecting computer vision and natural language, TSGV has
drawn significant attention from researchers in both communities. This survey
attempts to provide a summary of fundamental concepts in TSGV and current
research status, as well as future research directions. As the background, we
present a common structure of functional components in TSGV, in a tutorial
style: from feature extraction from raw video and language query, to answer
prediction of the target moment. Then we review the techniques for multimodal
understanding and interaction, which is the key focus of TSGV for effective
alignment between the two modalities. We construct a taxonomy of TSGV
techniques and elaborate the methods in different categories with their
strengths and weaknesses. Lastly, we discuss issues with the current TSGV
research and share our insights about promising research directions.
comment: Accepted by IEEE Transactions on Pattern Analysis and Machine
Intelligence (TPAMI)
♻ ☆ Accidental Learners: Spoken Language Identification in Multilingual Self-Supervised Models ICASSP 2023
In this paper, we extend previous self-supervised approaches for language
identification by experimenting with Conformer based architecture in a
multilingual pre-training paradigm. We find that pre-trained speech models
optimally encode language discriminatory information in lower layers. Further,
we demonstrate that the embeddings obtained from these layers are significantly
robust to classify unseen languages and different acoustic environments without
additional training. After fine-tuning a pre-trained Conformer model on the
VoxLingua107 dataset, we achieve results similar to current state-of-the-art
systems for language identification. More, our model accomplishes this with 5x
less parameters. We open-source the model through the NVIDIA NeMo toolkit.
comment: Submitted to ICASSP 2023
♻ ☆ EDU-level Extractive Summarization with Varying Summary Lengths EACL 2023
Extractive models usually formulate text summarization as extracting fixed
top-$k$ salient sentences from the document as a summary. Few works exploited
extracting finer-grained Elementary Discourse Unit (EDU) with little analysis
and justification for the extractive unit selection. Further, the selection
strategy of the fixed top-$k$ salient sentences fits the summarization need
poorly, as the number of salient sentences in different documents varies and
therefore a common or best $k$ does not exist in reality. To fill these gaps,
this paper first conducts the comparison analysis of oracle summaries based on
EDUs and sentences, which provides evidence from both theoretical and
experimental perspectives to justify and quantify that EDUs make summaries with
higher automatic evaluation scores than sentences. Then, considering this merit
of EDUs, this paper further proposes an EDU-level extractive model with Varying
summary Lengths and develops the corresponding learning algorithm. EDU-VL
learns to encode and predict probabilities of EDUs in the document, generate
multiple candidate summaries with varying lengths based on various $k$ values,
and encode and score candidate summaries, in an end-to-end training manner.
Finally, EDU-VL is experimented on single and multi-document benchmark datasets
and shows improved performances on ROUGE scores in comparison with
state-of-the-art extractive models, and further human evaluation suggests that
EDU-constituent summaries maintain good grammaticality and readability.
comment: Accepted to EACL 2023 Findings
♻ ☆ EasyNLP: A Comprehensive and Easy-to-use Toolkit for Natural Language Processing
Chengyu Wang, Minghui Qiu, Chen Shi, Taolin Zhang, Tingting Liu, Lei Li, Jianing Wang, Ming Wang, Jun Huang, Wei Lin
The success of Pre-Trained Models (PTMs) has reshaped the development of
Natural Language Processing (NLP). Yet, it is not easy to obtain
high-performing models and deploy them online for industrial practitioners. To
bridge this gap, EasyNLP is designed to make it easy to build NLP applications,
which supports a comprehensive suite of NLP algorithms. It further features
knowledge-enhanced pre-training, knowledge distillation and few-shot learning
functionalities for large-scale PTMs, and provides a unified framework of model
training, inference and deployment for real-world applications. Currently,
EasyNLP has powered over ten business units within Alibaba Group and is
seamlessly integrated to the Platform of AI (PAI) products on Alibaba Cloud.
The source code of our EasyNLP toolkit is released at GitHub
(https://github.com/alibaba/EasyNLP).
comment: 8 pages
♻ ☆ Rethinking the Reasonability of the Test Set for Simultaneous Machine Translation ICASSP 2023
Simultaneous machine translation (SimulMT) models start translation before
the end of the source sentence, making the translation monotonically aligned
with the source sentence. However, the general full-sentence translation test
set is acquired by offline translation of the entire source sentence, which is
not designed for SimulMT evaluation, making us rethink whether this will
underestimate the performance of SimulMT models. In this paper, we manually
annotate a monotonic test set based on the MuST-C English-Chinese test set,
denoted as SiMuST-C. Our human evaluation confirms the acceptability of our
annotated test set. Evaluations on three different SimulMT models verify that
the underestimation problem can be alleviated on our test set. Further
experiments show that finetuning on an automatically extracted monotonic
training set improves SimulMT models by up to 3 BLEU points.
comment: Accepted by 48th IEEE International Conference on Acoustics, Speech,
and Signal Processing (ICASSP 2023)
♻ ☆ NASTyLinker: NIL-Aware Scalable Transformer-based Entity Linker ESWC'23
Entity Linking (EL) is the task of detecting mentions of entities in text and
disambiguating them to a reference knowledge base. Most prevalent EL approaches
assume that the reference knowledge base is complete. In practice, however, it
is necessary to deal with the case of linking to an entity that is not
contained in the knowledge base (NIL entity). Recent works have shown that,
instead of focusing only on affinities between mentions and entities,
considering inter-mention affinities can be used to represent NIL entities by
producing clusters of mentions. At the same time, inter-mention affinities can
help to substantially improve linking performance for known entities. With
NASTyLinker, we introduce an EL approach that is aware of NIL entities and
produces corresponding mention clusters while maintaining high linking
performance for known entities. The approach clusters mentions and entities
based on dense representations from Transformers and resolves conflicts (if
more than one entity is assigned to a cluster) by computing transitive
mention-entity affinities. We show the effectiveness and scalability of
NASTyLinker on NILK, a dataset that is explicitly constructed to evaluate EL
with respect to NIL entities. Further, we apply the presented approach to an
actual EL task, namely to knowledge graph population by linking entities in
Wikipedia listings, and provide an analysis of the outcome.
comment: Preprint of a paper in the research track of the 20th Extended
Semantic Web Conference (ESWC'23)
♻ ☆ Adaptive Machine Translation with Large Language Models
Consistency is a key requirement of high-quality translation. It is
especially important to adhere to pre-approved terminology and adapt to
corrected translations in domain-specific projects. Machine translation (MT)
has achieved significant progress in the area of domain adaptation. However,
real-time adaptation remains challenging. Large-scale language models (LLMs)
have recently shown interesting capabilities of in-context learning, where they
learn to replicate certain input-output text generation patterns, without
further fine-tuning. By feeding an LLM at inference time with a prompt that
consists of a list of translation pairs, it can then simulate the domain and
style characteristics. This work aims to investigate how we can utilize
in-context learning to improve real-time adaptive MT. Our extensive experiments
show promising results at translation time. For example, GPT-3.5 can adapt to a
set of in-domain sentence pairs and/or terminology while translating a new
sentence. We observe that the translation quality with few-shot in-context
learning can surpass that of strong encoder-decoder MT systems, especially for
high-resource languages. Moreover, we investigate whether we can combine MT
from strong encoder-decoder models with fuzzy matches, which can further
improve translation quality, especially for less supported languages. We
conduct our experiments across five diverse language pairs, namely
English-to-Arabic (EN-AR), English-to-Chinese (EN-ZH), English-to-French
(EN-FR), English-to-Kinyarwanda (EN-RW), and English-to-Spanish (EN-ES).
♻ ☆ I-Tuning: Tuning Frozen Language Models with Image for Lightweight Image Captioning ICASSP 2023
Image Captioning is a traditional vision-and-language task that aims to
generate the language description of an image. Recent studies focus on scaling
up the model size and the number of training data, which significantly increase
the cost of model training. Different to these heavy-cost models, we introduce
a lightweight image captioning framework (I-Tuning), which contains a small
number of trainable parameters. We design a novel I-Tuning cross-attention
module to connect the non-trainable pre-trained language decoder GPT2 and
vision encoder CLIP-ViT. Since most parameters are not required to be updated
during training, our framework is lightweight and fast. Experimental results
conducted on three image captioning benchmarks reveal that our framework
achieves comparable or better performance than the large-scale baseline
systems. But our models contain up to 10 times fewer trainable parameters and
require much fewer data for training compared with state-of-the-art baselines.
comment: ICASSP 2023
♻ ☆ DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech ICASSP 2023
The majority of current Text-to-Speech (TTS) datasets, which are collections
of individual utterances, contain few conversational aspects. In this paper, we
introduce DailyTalk, a high-quality conversational speech dataset designed for
conversational TTS. We sampled, modified, and recorded 2,541 dialogues from the
open-domain dialogue dataset DailyDialog inheriting its annotated attributes.
On top of our dataset, we extend prior work as our baseline, where a
non-autoregressive TTS is conditioned on historical information in a dialogue.
From the baseline experiment with both general and our novel metrics, we show
that DailyTalk can be used as a general TTS dataset, and more than that, our
baseline can represent contextual information from DailyTalk. The DailyTalk
dataset and baseline code are freely available for academic use with CC-BY-SA
4.0 license.
comment: 5 pages, 1 figures, 4 tables. Accepted to ICASSP 2023
♻ ☆ Alternate Intermediate Conditioning with Syllable-level and Character-level Targets for Japanese ASR
End-to-end automatic speech recognition directly maps input speech to
characters. However, the mapping can be problematic when several different
pronunciations should be mapped into one character or when one pronunciation is
shared among many different characters. Japanese ASR suffers the most from such
many-to-one and one-to-many mapping problems due to Japanese kanji characters.
To alleviate the problems, we introduce explicit interaction between characters
and syllables using Self-conditioned connectionist temporal classification
(CTC), in which the upper layers are ``self-conditioned'' on the intermediate
predictions from the lower layers. The proposed method utilizes character-level
and syllable-level intermediate predictions as conditioning features to deal
with mutual dependency between characters and syllables. Experimental results
on Corpus of Spontaneous Japanese show that the proposed method outperformed
the conventional multi-task and Self-conditioned CTC methods.
comment: SLT 2022
♻ ☆ Self-Attention Networks Can Process Bounded Hierarchical Languages ACL 2021
Despite their impressive performance in NLP, self-attention networks were
recently proved to be limited for processing formal languages with hierarchical
structure, such as $\mathsf{Dyck}_k$, the language consisting of well-nested
parentheses of $k$ types. This suggested that natural language can be
approximated well with models that are too weak for formal languages, or that
the role of hierarchy and recursion in natural language might be limited. We
qualify this implication by proving that self-attention networks can process
$\mathsf{Dyck}_{k, D}$, the subset of $\mathsf{Dyck}_{k}$ with depth bounded by
$D$, which arguably better captures the bounded hierarchical structure of
natural language. Specifically, we construct a hard-attention network with
$D+1$ layers and $O(\log k)$ memory size (per token per layer) that recognizes
$\mathsf{Dyck}_{k, D}$, and a soft-attention network with two layers and
$O(\log k)$ memory size that generates $\mathsf{Dyck}_{k, D}$. Experiments show
that self-attention networks trained on $\mathsf{Dyck}_{k, D}$ generalize to
longer inputs with near-perfect accuracy, and also verify the theoretical
memory advantage of self-attention networks over recurrent networks.
comment: ACL 2021. 19 pages with extended appendix. Fixed a small typo in the
formula at the end of page 5 (thank to Gabriel Faria). Code:
https://github.com/princeton-nlp/dyck-transformer